Red Wine Data Exploration by GEO

This report explores dataset containing 1599 red wines with their 11 chemical propertiesand rating for the quality of wine

Univariate Plots Section

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Best way to plot the distribution of quantitative variables is through hitogram. Box plot helps us to visualize the outliers present if any and shows us where 50 percent of our values lie.

Fixed acidity

## [1]  4.6 15.9

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

From the above plot we can see that the distribution is positively skewed. Boxplot shows that 50 percent of the values are between 7.1 and 9.2. The mean 8.2 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1] 0.12 1.58

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The above distribution looks like bimodal distribution. Boxplot shows that 50 percent of the values are between .39 and .64. Here the mean and the median are both almost same with mean little high.

## [1] 0 1
##  num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

From the above plot we can see that the histogram does not follow a normal distribution. Values are almost equally spread out between 0 and 0.5. Boxplot shows that 50 percent of the values around median are between 0.090 and 0.420. The mean 0.271 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1]  0.9 15.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

From the above plot we can see that the histogram is skewed to the right with many outliers. Boxplot shows that 50 percent of the values around the median are between 1.90 and 2.60 The mean 2.539 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1] 0.012 0.611

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Here too the histogram is skewed to the right with outlier which is removed in this plot. Boxplot shows that 50 percent of the values are between 0.07 and 0.09 The mean 0.087 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1]  1 72

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

From the above plot we can see that the histogram is positively skewed with many outliers. Boxplot shows that 50 percent of the values are between 7 and 21. The mean 15.87 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1]   6 289

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

From the above plot we can see that the histogram is positively skewed with outliers. Boxplot shows that 50 percent of the values around the median are between 22 and 62. The mean 46.47 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1] 0.99007 1.00369

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

From the above plot we can see that the histogram is almost symmetric and follows a normal distribution. Boxplot shows that 50 percent of the values around the median are between 0.9956 and 0.9978 In this case the mean and the median are almost equal as the distribution is symmetric.

## [1] 2.74 4.01

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The above distribution is also almost symmetric. Boxplot shows that 50 percent of the values around the median are between 3.210 and 3.4. The mean and the median are almost equal as the distribution is symmetric.

## [1] 0.33 2.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The above distribution is right skewed with some outliers. Boxplot shows that 50 percent of the values around the median are between 0.55 and 0.73. The mean 0.6581 which is marked in the plot in pink is more than the median because of outliers which is also highlighted in blue in the plot.

## [1]  8.4 14.9

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The distribution of alcohol is also right skewed. Boxplot shows that 50 percent of the values are between 9.5 and 11.10. The mean 10.42 is little more than the median as we can see from the plot due to the outliers.

The above plot shows the distribution of the quality using bar chart.

Univariate Analysis

The dataset has 1599 rows with 11 numerical variables which determine the quality of the wine. The column quality is a categorical variable which is converted to factor. Here quality of wine is the feature of interest for which we need to find which variables affect the quality of the wine. Fiexed acidity should help in increasing the quality of the wine as it adds to the sour taste of wine. Volatile acidity generally degrades the quality of the wine so the effect of volatile acidity on the quality of the wine should be studied. Alcohol might affect the quality of the wine. Many of the distribution is positively skewed due to outliers which were handled by limiting the values in the plot.

Bivariate Plots Section

## 'data.frame':    1599 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num  3 3 3 4 3 3 3 5 5 3 ...
##  $ quality_rating      : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 2 2 2 ...

From the above correlation plot below are the findings:

Fixed acidity is positively correlated with density. This could be because with increase in the mass of the wine the acidic content also increase.

Citric acid is positively correlated with fixed acidity. Since citric acid is part of fixed acidity content this is expected.

Also we can see that alcohol percentage is negatively correlated with density. But both fixed acidity and alcohol show positive correlation with the quality of the wine. This needs to be studied more to understand what really affects the wine quality.

Volatile acidity is negatively correlated with citric acid and quality. With higher amount of volatile acid the quality gets affected.

pH is negatively correlated with fixed acidity and citric acid. pH always decreases with increase in the acidic content.

Residual sugar and density show positive correlation as well.

Let’s visualize the above findings through plots to get a better intuition.

## [1] 0.6717034

Above scatter plot shows the correlation between fixed acidity and citric acid.

## [1] 0.6680473

Above scatter plot shows the correlation between density and fixed.acidity which is positively correlated. Increase in density of wine means increase in the quantity of wine which will increase the acidic content as well.

## [1] -0.4961798

Above scatter plot shows the correlation between density and alcohol which is negatively correlated. With increase in the density of the wine the percentage of the alcohol content will decrease.

## [1] -0.2561309

Fixed.acidity seems to be slightly negatively correlated with volatile acidity.

## [1] 0.3552834

Density is slightly positively correlated with residual.sugar. This might be because the increase in wine content also increases the amount of residual sugar present in wine.

## [1] 0.02202623

Looks like there is no much correlation between volatile acidity and density.

## [1] -0.6829782

Fixed acidity and pH are negatively correlated. This is obvious because with increase in acidity the pH always decreases.

## [1]  8.4 14.9

From the above plot we can see that higher alcohol content is associated with good quality rating and the mean of the alcohol percentage increases with increase in the quality.

## [1] 0 1

From the above plot we can see that the higher citric acid content is associated with higher quality rating. The mean of the citric acid level increases with increase in the quality rating.

## [1] 0.99007 1.00369

Density seems to have decreasing effect on quality which might be due to the decrease in alcohol content.

## [1] 0.12 1.58

Volatile acidity definitely has a decreasing effect on the quality of the wine.

## [1] 0.012 0.611

Chloride too has a negative effect on the quality of the wine.

## [1] 0.33 2.00

Sulphates has a positive impact on the quality of the wine.

## [1]  0.9 15.5

Residual sugar does not seem to have any impact on the quality of the wine.

## [1]  1 72

Free sulfur too does not seem to have any impact on the quaity of the wine.

Bivariate Analysis

Alcohol has more positive impact on the quality of the wine. Fixed acidity or the citric acid too has positive impact on the quality of the wine. Sulphates has a slight positive impact on the quality of the alcohol.

Volatile acidity has more negative impact on the quality of the wine. Chlorides too has some negative impact on the quality of the wine. Density too has slight negative impact which might be due to decrease in alcohol. Though citric acid is positively correlated with density of alcohol both have varying effect on the quality of the alcohol while the former has highly positive effect on the quality the later has slight decreasing effect on the alcohol.

Overall the strongest relationship was between alcohol percentage and the alcohol quality.

Multivariate Plots Section

From the above plot we can see that the higher fixed acidity content is associated with more average rating than the bad rating thus fixedity acid seems to have some positive impact on the quality of the wine along with alcohol.

From the above plot we can see that the higher citric acid content is associated with more average rating than the bad rating for lower percentage of alcohol thus citric acid seems to have some positive impact on the quality of the wine along with alcohol.

From the above plot we can see that moderate value of alcohol and higher citric acid is associated with higher quality.

From the above plot we can see that higer chlorides are associated with bad rating.

Sulphates seems to have mild positive impact on the quality with higher sulphate levels are associated with higer quality ratings.

Residual sugar does not seem to have any effect on the quality of the wine.

Multivariate Analysis

In the multivariate analysis we found how citric acid contributes to quality along with alcohol content.

Were there any interesting or surprising interactions between features?

Chlorides seems to have negative impact on the quality wherease sulphates seems to have positive impact on the quality of wine. Residual sugar does not affect the quality of the wine.


Final Plots and Summary

Plot One

This plot shows that alcohol affects the quality of wine in a positive way. With increase in the alcohol percentage the quality of the wine is better.

Description One

Plot Two

Citric acid also adds to the quality of wine mostly due to the sour taste it adds to the wine. Citric acid in combination with alcohol are the major factors which determines the quality of the wine.

Description Two

Plot Three

Description Three

Volatile acidity degrades the quality of the wine. It spoils the wine taste. Thus lower the volatile acidity content better will be the quality of the wine.


Reflection

By analysing the variables that affect the quality of the wine we are able to get a better insight of how each variable has affected the quality of the wine. Initially it looked like density and alcohol both had positive correlation with quantity. But later on analysis it was clear that it was only alcohol that has positive correlation. Only the citric acid adds to the quality of the wine which is correlated with density of wine. It was challenging to find which variable actually has strong relation with quality. Multivariate plot helped to compare the variables and to determine what affects the quality of wine. Free sulfur dioxide and total sulfur dioxide did not have any effect on the quality of the wine. While sulphates have some positive effect on the quaility of wine chlorides have degrading effect on the quality of wine.